-
Notifications
You must be signed in to change notification settings - Fork 1.2k
MS maintenance improvements #10417
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MS maintenance improvements #10417
Conversation
|
@blueorangutan package |
|
@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #10417 +/- ##
============================================
+ Coverage 16.17% 16.30% +0.12%
- Complexity 13291 13440 +149
============================================
Files 5668 5674 +6
Lines 498179 499203 +1024
Branches 60290 60364 +74
============================================
+ Hits 80581 81375 +794
- Misses 408578 408758 +180
- Partials 9020 9070 +50
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 12497 |
|
@blueorangutan test |
|
@sureshanaparti a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests |
|
[SF] Trillian test result (tid-12458)
|
…utdown is initiated [Any load balancer in the clustered environment can avoid routing requests to this MS node]
e8699b3 to
0f0f8e7
Compare
- block new agent connections during prepare for maintenance of ms - maintain avoids ms list - propagate updated management servers list and lb algorithm in host and indirect.agent.lb.algorithm settings respectively, to systemvm (non-routing) agents - updated setup ms list and migrate agent connections to executor service - migrate agent connection through executor, and send the answer to the ms host that initiated the migration - re-initialize ssl handshake executor if it is shutdown - don't allow prepare for maintenance or shutdown when other management server nodes are in preparing states - don't allow trigger shutdown when management server is up and other management server nodes are in preparing states - stop agent connections monitor on ms maintenance - update avoid ms list in ready command - updated connected host from the client connection - update last agents in ms metrics from the database - updated some agent config descriptions - update last management server in the hosts during shutdown - added agents and lastagents in management server response - updated management server maintenance & shutdown unit tests - some code improvements
0f0f8e7 to
9ef1c12
Compare
|
@blueorangutan package |
|
@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
|
Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 12677 |
|
@blueorangutan test |
|
@blueorangutan test |
|
@rohityadavcloud a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests |
seems unrelated to ms maintenance unit tests, will check (restarted it with debugging). |
@shwstppr fixed test, shutdown test is calling system.exit. |
|
@blueorangutan package |
|
@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
|
Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 12748 |
|
@blueorangutan test |
|
@sureshanaparti a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests |
|
[SF] Trillian test result (tid-12662)
|
|
@blueorangutan package |
|
@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
|
Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 12805 |
|
@blueorangutan test |
|
@rohityadavcloud a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests |
kiranchavala
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, Tested manually. Please find the repro steps
Issue 1
1. Cloudstack should throw error 503 when the management is in maintenance mode and async api calls are called
Before fix
1. When maintenance mode is enabled on the management server
2. Execute the api call , the error 530 is thrown
(localcloud) 🐱 > create volume diskofferingid=51a838c6-1428-4e6d-bbbb-4f774e062719 zoneid=1163fe4e-06f9-4763-bad8-47db23f8c875
🙈 Error: (HTTP 530, error code 4250) Maintenance or Shutdown has been initiated on this management server. Can not accept new jobs
After fix
(localcloud) 🐱 > create volume diskofferingid=2e64ecf3-58a5-4d0c-80cd-43ec93515b35 zoneid=3a42f982-fa53-4bbc-8843-6c3b4b4aaa6c
🙈 Error: (HTTP 503, error code 9999) Maintenance or Shutdown has been initiated on this management server. Can not accept new jobs
Issue 2
2. Cloudstack should migrate the agents associated with a management server when manitainence mode is enabled on the management server. The list management server API cal and list managementservermetrics should list the agents and lastagents in the output
Steps to verify the feature
- Have a cloudstack environment with multiple management server
- Execute the following api calls
localcloud) 🐱 > list managementservers filter=name,agents,lastagents
count = 3
managementserver:
+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------+
| NAME | AGENTS | LASTAGENTS |
+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------+
| ref-trl-8103-k-mol8-kiran-chavala-mgmt1.sofia.shapeblue.com | ["b2418798-c3d4-453b-b7a7-d3e20cdb5f3d","90bd9ae9-b084-47ef-9415-650d4c1d2fa7","a87cf0ff-f513-4936-ba80-60759bec4bb4","24b86b9f-62a5-427c-92f4-fa8fd42e5135"] | ["b2418798-c3d4-453b-b7a7-d3e20cdb5f3d","24b86b9f-62a5-427c-92f4-fa8fd42e5135"] |
| ref-trl-8103-k-mol8-kiran-chavala-mgmt2.sofia.shapeblue.com | [] | ["90bd9ae9-b084-47ef-9415-650d4c1d2fa7","a87cf0ff-f513-4936-ba80-60759bec4bb4"] |
| mgmt3.sofia.shapeblue.com | [] | [] |
(localcloud) 🐱 > list managementserversmetrics filter=name,agents,lastagents
managementserver:
count = 3
+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------+
| NAME | AGENTS | LASTAGENTS |
+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------+
| ref-trl-8103-k-mol8-kiran-chavala-mgmt1.sofia.shapeblue.com | ["b2418798-c3d4-453b-b7a7-d3e20cdb5f3d","90bd9ae9-b084-47ef-9415-650d4c1d2fa7","a87cf0ff-f513-4936-ba80-60759bec4bb4","24b86b9f-62a5-427c-92f4-fa8fd42e5135"] | ["b2418798-c3d4-453b-b7a7-d3e20cdb5f3d","24b86b9f-62a5-427c-92f4-fa8fd42e5135"] |
| ref-trl-8103-k-mol8-kiran-chavala-mgmt2.sofia.shapeblue.com | [] | ["90bd9ae9-b084-47ef-9415-650d4c1d2fa7","a87cf0ff-f513-4936-ba80-60759bec4bb4"] |
| mgmt3.sofia.shapeblue.com | [] | [] |
+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------+
-
Execute the api prepare for maintainence api call
prepare formaintenance managementserverid= -
Execute the list managementservers and list managementserversmetrics api call again , you will observe the agents gets migrated to other management servers,
Check the agents and lastagents outputs
The api call list managementserversmetrics gets updated after the value mentioned in the global setting management.server.stats.interval
(localcloud) 🐱 > list managementservers filter=name,agents,lastagents
count = 3
managementserver:
+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
| NAME | AGENTS | LASTAGENTS |
+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ref-trl-8103-k-mol8-kiran-chavala-mgmt1.sofia.shapeblue.com | [] | ["b2418798-c3d4-453b-b7a7-d3e20cdb5f3d","90bd9ae9-b084-47ef-9415-650d4c1d2fa7","a87cf0ff-f513-4936-ba80-60759bec4bb4","24b86b9f-62a5-427c-92f4-fa8fd42e5135"] |
| ref-trl-8103-k-mol8-kiran-chavala-mgmt2.sofia.shapeblue.com | ["b2418798-c3d4-453b-b7a7-d3e20cdb5f3d","90bd9ae9-b084-47ef-9415-650d4c1d2fa7","a87cf0ff-f513-4936-ba80-60759bec4bb4","24b86b9f-62a5-427c-92f4-fa8fd42e5135"] | [] |
| mgmt3.sofia.shapeblue.com | [] | [] |
+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
(localcloud) 🐱 > list managementserversmetrics filter=name,agents,lastagents
managementserver:
count = 3
+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
| NAME | AGENTS | LASTAGENTS |
+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ref-trl-8103-k-mol8-kiran-chavala-mgmt1.sofia.shapeblue.com | [] | ["b2418798-c3d4-453b-b7a7-d3e20cdb5f3d","90bd9ae9-b084-47ef-9415-650d4c1d2fa7","a87cf0ff-f513-4936-ba80-60759bec4bb4","24b86b9f-62a5-427c-92f4-fa8fd42e5135"] |
| ref-trl-8103-k-mol8-kiran-chavala-mgmt2.sofia.shapeblue.com | ["b2418798-c3d4-453b-b7a7-d3e20cdb5f3d","90bd9ae9-b084-47ef-9415-650d4c1d2fa7","a87cf0ff-f513-4936-ba80-60759bec4bb4","24b86b9f-62a5-427c-92f4-fa8fd42e5135"] | [] |
| mgmt3.sofia.shapeblue.com | [] | [] |
+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
(localcloud) 🐱 >
Issue 3
3. Cloudstack should redistribute the agents associated with a management server based on the global setting indirect.agent.lb.algorithm and indirect.agent.lb.check.interval
Steps to verify the feature
- Have a cloudstack environment with multiple management server and agents connected to one management server
localcloud) 🐱 > list managementservers filter=name,agents,lastagents
count = 3
managementserver:
+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------+
| NAME | AGENTS | LASTAGENTS |
+-------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------+
| ref-trl-8103-k-mol8-kiran-chavala-mgmt1.sofia.shapeblue.com | ["b2418798-c3d4-453b-b7a7-d3e20cdb5f3d","90bd9ae9-b084-47ef-9415-650d4c1d2fa7","a87cf0ff-f513-4936-ba80-60759bec4bb4","24b86b9f-62a5-427c-92f4-fa8fd42e5135"] | ["b2418798-c3d4-453b-b7a7-d3e20cdb5f3d","24b86b9f-62a5-427c-92f4-fa8fd42e5135"] |
| ref-trl-8103-k-mol8-kiran-chavala-mgmt2.sofia.shapeblue.com | [] | ["90bd9ae9-b084-47ef-9415-650d4c1d2fa7","a87cf0ff-f513-4936-ba80-60759bec4bb4"] |
| mgmt3.sofia.shapeblue.com | [] | [] |
-
Change the value of global setting “indirect.agent.lb.algorithm” from static to roundrobin/shuffle
-
Execute the api prepare for maintainence api call prepare formaintenance managementserverid=
-
The agent are distributed among the management servers after the value indirect.agent.lb.check.interval l
localcloud) 🐱 > list managementservers filter=name,agents,lastagents
count = 3
managementserver:
+-------------------------------------------------------------+---------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
| NAME | AGENTS | LASTAGENTS |
+-------------------------------------------------------------+---------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ref-trl-8103-k-mol8-kiran-chavala-mgmt1.sofia.shapeblue.com | ["b2418798-c3d4-453b-b7a7-d3e20cdb5f3d","24b86b9f-62a5-427c-92f4-fa8fd42e5135"] | ["b2418798-c3d4-453b-b7a7-d3e20cdb5f3d","90bd9ae9-b084-47ef-9415-650d4c1d2fa7","a87cf0ff-f513-4936-ba80-60759bec4bb4","24b86b9f-62a5-427c-92f4-fa8fd42e5135"] |
| ref-trl-8103-k-mol8-kiran-chavala-mgmt2.sofia.shapeblue.com | ["90bd9ae9-b084-47ef-9415-650d4c1d2fa7"] | [] |
| mgmt3.sofia.shapeblue.com | ["a87cf0ff-f513-4936-ba80-60759bec4bb4"] | [] |
+-------------------------------------------------------------+---------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
Issue 4
Events added for managaement server maintainence mode
|
[SF] Trillian test result (tid-12717)
|
|
Merging this based on the review & tests. |
* Update last agents during ms maintenance, and some code improvements * Send 503 (Service Unavailable) response status when maintenance or shutdown is initiated [Any load balancer in the clustered environment can avoid routing requests to this MS node] * Migrate systemvm agents before routing host agents, and some code improvements * Added events for ms maintenance and shutdown operations * Added the following ms maintenance and shutdown improvements - block new agent connections during prepare for maintenance of ms - maintain avoids ms list - propagate updated management servers list and lb algorithm in host and indirect.agent.lb.algorithm settings respectively, to systemvm (non-routing) agents - updated setup ms list and migrate agent connections to executor service - migrate agent connection through executor, and send the answer to the ms host that initiated the migration - re-initialize ssl handshake executor if it is shutdown - don't allow prepare for maintenance or shutdown when other management server nodes are in preparing states - don't allow trigger shutdown when management server is up and other management server nodes are in preparing states - stop agent connections monitor on ms maintenance - update avoid ms list in ready command - updated connected host from the client connection - update last agents in ms metrics from the database - updated some agent config descriptions - update last management server in the hosts during shutdown - added agents and lastagents in management server response - updated management server maintenance & shutdown unit tests - some code improvements * refactored code / addressed comments * removed shutdown testcase (maybe, calling System.exit) * Revert "removed shutdown testcase (maybe, calling System.exit)" This reverts commit e14b071. * avoid system.exit during shutdown test * code improvements * testcase fix * Fix cutoff time in agent connections monitor thread

Description
This PR addresses the following improvements during MS maintenance
[Any load balancer in the clustered environment can avoid routing requests to this MS node]
Types of changes
Feature/Enhancement Scale or Bug Severity
Feature/Enhancement Scale
Bug Severity
Screenshots (if appropriate):
How Has This Been Tested?
Manually tested the changes.
503 Service Unavailable response =>
How did you try to break this feature and the system with this change?